Ltx2.3 a2v& retake video and audio#1346
Conversation
There was a problem hiding this comment.
Pull request overview
Adds LTX-2.3 audio-to-video (A2V) and video/audio “retake” (region-based regeneration) support, along with runnable examples and documentation links for the new workflows.
Changes:
- Add
torchaudioas a dependency and introduceread_audio_with_torchaudio(+ resampling helper) in LTX2 media I/O. - Extend
LTX2AudioVideoPipelineto supportretake_video*andretake_audio*inputs via new pipeline units and inpaint-mask handling for both video and audio latents. - Add new LTX-2.3 TwoStage example scripts (normal + low-VRAM) and document them in README/docs tables.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds torchaudio dependency needed by new/existing audio functionality. |
| diffsynth/utils/data/media_io_ltx2.py | Adds torchaudio-based audio loading + optional resampling utilities. |
| diffsynth/pipelines/ltx2_audio_video.py | Implements audio/video retake embedding + mask-driven denoising for A2V/retake. |
| examples/ltx2/model_inference/LTX-2.3-A2V-TwoStage.py | New A2V TwoStage example. |
| examples/ltx2/model_inference/LTX-2.3-T2AV-TwoStage-Retake.py | New TwoStage retake example (video + audio regions). |
| examples/ltx2/model_inference_low_vram/LTX-2.3-A2V-TwoStage.py | Low-VRAM variant of A2V TwoStage example. |
| examples/ltx2/model_inference_low_vram/LTX-2.3-T2AV-TwoStage-Retake.py | Low-VRAM variant of retake example. |
| docs/en/Model_Details/LTX-2.md | Adds table entries linking to A2V/Retake examples. |
| docs/zh/Model_Details/LTX-2.md | Adds table entries linking to A2V/Retake examples. |
| README.md | Adds table entries linking to A2V/Retake examples. |
| README_zh.md | Adds table entries linking to A2V/Retake examples. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def __init__(self): | ||
| super().__init__( | ||
| input_params=("retake_audio", "seed", "rand_device", "retake_audio_regions"), | ||
| output_params=("input_latents_audio", "audio_noise", "audio_positions", "audio_latent_shape", "denoise_mask_audio", "audio_latents"), |
There was a problem hiding this comment.
LTX2AudioVideoUnit_AudioRetakeEmbedder declares output_params including denoise_mask_audio and audio_latents, but process() never returns audio_latents (and only conditionally returns denoise_mask_audio when retake_audio is provided). Since output_params are used to build the pipeline dependency/update graph, this mismatch can lead to incorrect unit splitting/order analysis during training/data-processing. Align output_params with the actual returned keys (or return audio_latents if it is intended to be set here).
| output_params=("input_latents_audio", "audio_noise", "audio_positions", "audio_latent_shape", "denoise_mask_audio", "audio_latents"), | |
| output_params=("input_latents_audio", "audio_noise", "audio_positions", "audio_latent_shape", "denoise_mask_audio"), |
| def read_audio_with_torchaudio(path: str, start_time: float = 0, duration: float | None = None, resample: bool = False, resample_rate: int = 48000) -> torch.Tensor: | ||
| waveform, sample_rate = torchaudio.load(path, channels_first=True) | ||
| if resample: | ||
| waveform = resample_waveform(waveform, sample_rate, resample_rate) | ||
| sample_rate = resample_rate | ||
| start_frame = int(start_time * sample_rate) | ||
| if start_frame > waveform.shape[-1]: | ||
| raise ValueError(f"start_time of {start_time} exceeds max duration of {waveform.shape[-1] / sample_rate:.2f}") | ||
| end_frame = -1 if duration is None else int(duration * sample_rate + start_frame) |
There was a problem hiding this comment.
read_audio_with_torchaudio is annotated as returning torch.Tensor, but it actually returns a (waveform, sample_rate) tuple. Also, when duration is None the code sets end_frame = -1, which slices off the final sample (and changes the full-length audio by default). Consider updating the return type annotation to a tuple and using end_frame = None (or omitting the end index) when duration is not provided; also consider treating start_frame == waveform.shape[-1] as out-of-range if empty audio is not intended.
| def read_audio_with_torchaudio(path: str, start_time: float = 0, duration: float | None = None, resample: bool = False, resample_rate: int = 48000) -> torch.Tensor: | |
| waveform, sample_rate = torchaudio.load(path, channels_first=True) | |
| if resample: | |
| waveform = resample_waveform(waveform, sample_rate, resample_rate) | |
| sample_rate = resample_rate | |
| start_frame = int(start_time * sample_rate) | |
| if start_frame > waveform.shape[-1]: | |
| raise ValueError(f"start_time of {start_time} exceeds max duration of {waveform.shape[-1] / sample_rate:.2f}") | |
| end_frame = -1 if duration is None else int(duration * sample_rate + start_frame) | |
| def read_audio_with_torchaudio( | |
| path: str, | |
| start_time: float = 0, | |
| duration: float | None = None, | |
| resample: bool = False, | |
| resample_rate: int = 48000, | |
| ) -> tuple[torch.Tensor, int]: | |
| waveform, sample_rate = torchaudio.load(path, channels_first=True) | |
| if resample: | |
| waveform = resample_waveform(waveform, sample_rate, resample_rate) | |
| sample_rate = resample_rate | |
| start_frame = int(start_time * sample_rate) | |
| if start_frame >= waveform.shape[-1]: | |
| raise ValueError(f"start_time of {start_time} exceeds max duration of {waveform.shape[-1] / sample_rate:.2f}") | |
| end_frame = None if duration is None else int(duration * sample_rate + start_frame) |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the capabilities of the LTX-2.3 audio-video generation pipeline by integrating new audio-to-video generation and advanced retake features for both video and audio. These additions provide users with greater control and flexibility in generating and refining multimedia content, enhancing the overall functionality and usability of the system. The changes are supported by updated documentation and practical examples. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces audio-to-video (A2V) and video/audio retake functionalities for the LTX-2.3 model. The changes include adding new pipeline units for handling retake video and audio, updating the main pipeline to accept new parameters, and adding corresponding example scripts. The implementation looks mostly correct, but I've found a critical issue in the pipeline unit definition that could break execution, a bug in audio slicing logic, and a minor type hint mismatch. My review includes suggestions to fix these issues.
| def __init__(self): | ||
| super().__init__( | ||
| input_params=("retake_audio", "seed", "rand_device", "retake_audio_regions"), | ||
| output_params=("input_latents_audio", "audio_noise", "audio_positions", "audio_latent_shape", "denoise_mask_audio"), |
There was a problem hiding this comment.
The output_params for this PipelineUnit includes "audio_latents", which is inconsistent with LTX2AudioVideoUnit_VideoRetakeEmbedder and requires the process method to return it (which it currently doesn't). The LTX2AudioVideoUnit_InputAudioEmbedder unit, which runs later in the pipeline, is responsible for setting audio_latents. To maintain consistency and ensure correct pipeline flow, "audio_latents" should be removed from output_params here.
| output_params=("input_latents_audio", "audio_noise", "audio_positions", "audio_latent_shape", "denoise_mask_audio"), | |
| output_params=("input_latents_audio", "audio_noise", "audio_positions", "audio_latent_shape", "denoise_mask_audio"), |
| resample: bool = False, | ||
| resample_rate: int = 48000, | ||
| ) -> tuple[torch.Tensor, int]: | ||
| waveform, sample_rate = torchaudio.load(path, channels_first=True) |
There was a problem hiding this comment.
When duration is None, end_frame is set to -1. In Python slicing, using -1 as the end index excludes the last element of the tensor. To slice until the very end of the tensor, None should be used instead. This will prevent unintentionally dropping the last audio sample.
| waveform, sample_rate = torchaudio.load(path, channels_first=True) | |
| end_frame = None if duration is None else int(duration * sample_rate + start_frame) |
| resampled = torchaudio.functional.resample(waveform, source_rate, target_rate) | ||
| return resampled.to(dtype=waveform.dtype) | ||
|
|
||
|
|
There was a problem hiding this comment.
The return type hint for this function is -> torch.Tensor, but it actually returns a tuple (waveform, sample_rate). The type hint should be updated to -> tuple[torch.Tensor, int] to accurately reflect the function's output.
| def read_audio_with_torchaudio(path: str, start_time: float = 0, duration: float | None = None, resample: bool = False, resample_rate: int = 48000) -> tuple[torch.Tensor, int]: |
No description provided.